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(57) Methods ol selecting tag nucleic acids and VL* 
SIPS™ arrays and the arrays made by the methods are 
used to label and track compositions, including cells and 



viruses, e.g., in libraries of cells or viruses. In addition 
to providing a way of tracking compositions in mixtures, 
the tags facilitate analysis of cell and viral phenotypes. 
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Descripti n 

FIELD OF THE INVENTION 

s This invention provides sets of nucleic acid tags, arrays of oligonucleotide probes, nucleic acid-tagged sets of 

recombinant cells and other compositions, and methods of selecting oligonucleotide probe arrays. The invention relates 
to the selection and interaction of nucleic acids, and nucleic acids immobilized on solid substrates, including related 
chemistry, biobgy : and medical diagnostic uses. 

w BACKGROUND OF THE INVENTION 

Methods of forming large arrays of oligonucleotides and other polymers on a solid substrate are known. Pirrung 
efa/. f U.S. Patent No. 5. 1 43.854 (see also PCT Application No. WO 90/1 5070). McGall et a/. ( U.S. Patent No. 5,41 2,087. 
- Chee et al. SN PCT/US94/12305 ; and Fodor et al, PCT Publication No. WO 92/10092 describe methods of forming 
is arrays of oligonucleotides and other polymers using, for example, light-directed synthesis techniques. 

In the Fodor et al publication, methods are described for using computer-controlled systems to direct polymer 
array synthesis. Using the Fodor approach, one heterogenous array of polymers is converted, through simultaneous 
coupling at multiple reaction sites : into a different heterogenous array. See also, Fodor et al (1991) Science, 251: 
767-777: Lipshutz et al. (1995) BioTechniques 19(3): 442-447; Fodor et al. (1993) Nature 364: 555-556; and Medlin 
20 (1995) Environmental Health Perspectives 244-246, The arrays are typically placed on a solid surface with an area 
less than 1 inch 2 , although much larger surfaces are optionally used. 

Additional methods applicable to polymer synthesis on a substrate are described, e.g., in US Pat. No. 5.384,261 . 
incorporated herein by reference for ail purposes. In the methods disclosed in these applications, reagents are delivered 
to the substrate by flowing or spotting polymer synthesis reagents on predefined regions of the solid substrate. In each 
25 instance, certain activated regions of the substrate are physically separated from other regions when the monomer 
solutions are delivered to the various reaction sites, e.g., by means of groves, wells and the like. 

Procedures for synthesizing polymer arrays are referred to herein as very large scale immobilized polymer syn- 
thesis (VLS1PS™) procedures. Oligonucleotide VLStPS™ arrays are useful, for instance, in a variety of procedures 
for monitoring test nucleic acids in a sample. In probe arrays with multiple probe sets, many distinct hybridization 
30 interactions can be monitored simultaneously. However, unwanted hybridization between probes, or between probes 
and other nucleic acids, can make analysis of multiple hybridizations problematic. This invention solves these and 
other problems. 

SUMMARY OF THE INVENTION 

35 

With this invention it is now possible to label and detect many individual components present, inter alia, in motecu lar, 
cellular and viral libraries using a limited number of hybridization conditions. Components are labeled with specially 
selected nucleic acid tags, and the presence of individual tags is monitored by hybridization to a probe array (typically 
a VLSIPS™ array of oligonucleotide probes). Thus, the tag nucleic acids are labels for the individual components, and 

•to the probe array provides a label reader which permits simultaneous detection of a very large number of tag nucleic 
acids. This facilitates massive parallel analysis of all of the components in a mixture in a single assay. 

For instance, as explained herein, all of the members of a cellular library can be tested for response to an envi- 
ronmental stimulus using a mixture of all of the members of the cellular library in a single assay. This is accomplished. 
e.g., by labeling each member of Ihe cellular library, e.g., by cloning a nucleic acid tag into each cell type in the library. 

4S mixing each cell type in the library in an appropriate solution, and exposing part of the solution to the selected envi- 
ronmental stimulus. The distribution of nucleic acids in the library before and after the environmental stimulus is com- 
pared by hybridization of the nucleic acids to a VLSIPS™ array, allowing for detection of cells which are specifically 
affected by the environmental stimulus. 

Accordingly, the present invention provides, inter alia, tag nucleic acids, sets of tag nucleic acids, methods of 

50 selecting tag nucleic acids, libraries of cells, viruses or the like containing tag nucleic acids, arrays of oligonucleotide 
probes, arrays of VLSIPS™ probes, methods of selecting arrays of oligonucleotide probes, methods of detecting tag 
nucleic acids with VLSIPS™ arrays and other features which will become clear upon further reading. 

In one class of embodiments, the invention provides a method of selecting a set of tag nucleic acids designed for 
minimal cross hybridization to a VLSI PS™ array. The absence of cross hybridization facilitates analysis of hybridization 

55 patterns to VLSIPS™ arrays, because it reduces ambiguities in the interpretation of hybridization results which arise 
due to multiple nucleic acid species binding to a single species of prob on the VLSIPS™ array. Thus, in the selection 
methods of the invention, potential tags are excluded from s t of tags wh re th y bind t th same nucleic acid as 
selected tags under stringent conditions. The selection methods typically include the steps of selecting a specific th r- 
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mal binding stability for the tag acids against complementary probes, and excluding tags which contain self-comple- 
mentary regions. Often, the thermal binding stability of the tags is selected by specifying parameters which influence 
binding stability, sucn as the length and base composition (e.g., by selecting tags with the sam AT to GC ratio of 
' nucleotides) for the tag nucleic acids is selected. In this regard, tags which form more GC bonds upon binding a 

s complementary probe require fewer overall bases to have the same binding stability with a complementary probe as 
tags which have fewer GC residues. Binding stability is also affected by base stacking interactions, the formation of 
secondary structures and the choice of solvent in which a tag is bound to a probe. 

The size of the tags can vary substantially, but is typically from about 8-150 nucleotides, more typically between 
10 and 100 nucleotides, often between about 15 and 30 nucleotides, generally between about 15 and 25 nucleotides 

w and, in one preferred embodiment, about 20 nucleotides in length. In a few applications, the tags are substantially 
longer than the probes to which they hybridize. The use of longer tags increases the number of tags from which non- 
cross hybridizing prcbes can be selected. 

The tag nucleic acids are optionally selected to have constant and variable regions, which facilitates elimination 
of secondary structure arising from self -complementarity, and provides structural features for cloning and amplifying 

'5 the tags. For instance, PCR binding sites or restriction enzyme sites are optionally incorporated into constant regions 
in the tags. In other embodiments, short constant regions are added in coding theory methods to prevent misalignment 
of the tags. Constant regions are optionally cleaved from the tag during processing steps, for instance by cleaving the 
tag nucleic acids with class II restriction enzymes. 

Often it is desirable to eliminate tags wnich contain runs of 4 nucleotides selected from the group consisting of 4 

20 x residues 4 Y residues and 4 Z residues, where X is selected from the group consisting of G and C, Y is selected 
from the group consisting of G and A : and Z is selected from the group consisting of A and T The elimination of tags 
from a tag set which contain such runs ot nucleotides reduces the formation of secondary structure in me selected 
tags in the tag set. in some embodiments, certain runs are permitted, while others are excluded. For instance, in one 
embodiment, runs of 4 A/T or G/C nucleotides are prohibited. 

25 m many embodiments, tags which differ by fewer than about 80% of the total number of nucleotides which comprise 

the tags are excluded. For instance, all selected tags in a selected tag set preferably differ by at least about 4-5 nu- 
cleotides. It is also desirable to exclude tags which share substantial regions of sequence identity, because the regions 
of identity can cross-hybridize to nucleic acids which have subsequence complementary to the region of identity. For 
instance, where 20-mer tags are identical over regions of 9 or more nucleotides, they are typically excluded. 

30 The tags in the tag sets of the invention typically differ by at least two nucleotides, and preferably by 3-5 nucleotides 

for a typical 20-mer. A list of tags which differ by at least two nucleotides can be generated by pairwise comparison ot 
each tag, or by other methods. For instance, the tag sequences can be aligned for maximal correspondence and tags 
with a single-mismatch discarded. In one class of embodiments, the number of A+G nucleotides in each of the variable 
regions of each of the tags is selected to be even (or, alternatively, odd), providing a "parity base" or "error correcting 

35 base" which provides that each tag have at least two hybridization mismatches between every tag in the tag set, and 
any individual complementary nucleic acid probe (other than the probe which is a perfect complement to the tag). Other 
methods of ensuring that at least two mismatches exist between every tag in a tag set and any individual hybridization 
probe are also appropriate. 

In general, the selection of the tag nucleic acids facilitates selection of the probe nucleic acids, e.g., on VLSIPS™ 

40 arrays used to monitor the tag nucleic acids by hybridization. Specifically, the probes on the array are selected for their 
ability to hybridize to variable sequences in the set of tag nucleic acids (the "variable' region of a tag which does not 
include a constant region is the entire tag). Thus, all of the rules for selection of tag nucleic acids can be applied to the 
selection of probe nucleic acids, for example by performing the tag selection steps and then determining the comple- 
mentary set of probe nucleic acids. 

J 5 In another class of embodiments, the invention provides compositions comprising sets of tag nucleic acids, which 

include a plurality of tag nucleic acids. In preferred embodiments, the set of tag nucleic acids comprises from 
100-100,000 tags. Typically, a tag set will include between about 500 and 15,000 tags. 5sually, the number of tags in 
a tag set is between about 5,000 and about 1 4.000 tags. In one preferred embodiment, a set of tags of the invention 
comprises about 6.000-9,000 tags. The tag sequences typically comprise a variable region, where the variable region 

so for each tag nucleic acid in the set of tag nucleic acids has the same the same G+C to A+T ratio, approximately the 
same T m , the same- length and do not cross-hybridize to a single complementary probe nucleic acid. Most typically, 
the tag nucleic acids in the set of tag nucleic acids cannot be aligned with less than two differences between any two 
of the tag nucleic acids in the set of tag nucleic acids, and often at least 5 differences exist between any 'pair of tags 
in a tag set. In one embodiment, the tags also comprise a constant region such as a PCR primer binding site for 

55 amplification of the tag. 

In one class of embodiments, the invention provides a method of labeling a composition, comprising associating 
a tag nucleic acid with the composition, wher in the tag nucleic acid is s feet d from a group of tag nucleic acids which 
do not cross-hybridize and which hav a substantially similar T m . Typically, the tag labels ar d tect d with a VLSIPS™ 
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array which comprises probes elementary <c , *« Mag, mo.ecu,ar .ibranes such as . 

As described here*, preferred ^P 0 ^^ 
-omb^.s re^ ^ usjng vLsipsTM arfays po( 

can also be labeled us.ng the nucleic ac.d tags ot ne nucleic acid tags, and counterfeits detected by 

instance, high denomination currency can be abe.ed w,th «- -^Sn of attached nuc.eic acids which encode 
monitoring hybridization ol a wash ot the currency (or. e.g., *™ K 

, ag sequences) with an a ^^ L ^2SLo^es methods of pre-selecting experimental probes in an oli- 
ln another class of ""^^^^^.nu-V uniform hybridization properties and do not cross 
gonucleotide probe array, wherein the probes nave » a*t nucleotides shared by the exoerimental 

hybridize to a target tag nucleic acid. In the metnods, a ratio of G+C to A + T o( the arr ay are determined, 

plobes in the array is selected and all possible 4 nuc.eo ce '^^^^^Si fro m the exper- 
Al potentia. probes from the array which contam pron,buea 4 '^^^J^^ uenei- are selected 

proper is hybridization stringency against a known nucleic acid. 

are optionally used in VLSIPS arrays 10 cr.ei. > simultaneously detected in a sample. 

in one class of methods of the invent.on a plu^« ^^Sze to aTarget under stringent conditions is 
in the methods, an array of experimental probes ^^^^^ 9 ^^ tknmlal pro £ « substantially 
used to detect the target nucleic acids. Typically the ratio of G+C bases , each expenm p & 
.entica,: The probes o, the array are arranged »,o ^ ^LoMe seance are 
population of oligonucleotide probes. For example, many ™ h ae0 metric shape. Probe sets are 

,„a„ 3 « In « • ■"J - « S^S"^S^S:p-- A . . V1-S1PS™ 

01 prooe sols on IM subsrace. „ unMr s „ ingenl „ >Ma . 

set in the array hybridizes to a different target nucleic acra unae y ' . variabie reg i 0 n. The variable 

G + C to A+T ratio and the same length) and does not cross-h ybr d ^^^^^ ? w0 Terences 
in the set of tag nucleic acids found in the different recombinan c£ "^J^SbSj 0 , genetically d.stinc, re- 
between the tag nucleic acids. Generally, the recombinan cells a e se i^ ^ a ^^ J red embod . 
combinant cells (eukaryotic, prokaryotic or archaebacterial) or ^^^^^^J^^^ orjgin . " 
iments, the cells are yeast cel.s.ln another class of preferred emb^« ^ "^E^SgoLwid. 

Thepresentinventionprov.es— 
probes in the array are arranged into probe sets ai aennea .uo<* / <. amo i e The oligonucleotide 

hybridization reactions between the oligonucleotide probes and es nuc e c a * "V^J; ^ mb ° Variety o. 
arrays can have virtual* any number <* S^t^^SSS i" one group of embod- 

test nucl ic acids or nucleic acid tags to be screen d against we array ay 
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iments, the array has from 1 0 up to 100 oligonucleotide sets. In other groups of embodiments, the arrays have between 
100 and 10.000 sets. In certain embodiments, the arrays have between 10.000 and 100.000 sets, and in yet other 
embodiments the arrays have between lOC.OOO and 1 .000,000 sets. Most preferred embodiments will have between 
7,500 and 12,500 sets. For example in one preferred embodiment, the arrays will comprise about 8,000 sets of oligo- 
nucleotide probes. In preferred embodiments, the array will have a density of more than 100 sets of oligonucleotides 
at known locations per cm2, or more preferably, more than 1 000 sets per cm*. In some embodiments, the arrays have 
a density of more than 10.000 sets per cm 2 . 

The present invention also provides kits embodying the inventive concepts outlined above. For examole. kits of 
the invention comprise any of the arrays, cells, libraries or tag sets described herein. Also, because the methods of 
w using the arrays and tags optionally include PCR. LCR and other in vitro amplification techniques for amplifying tag 
nucleic acids, the kits of the invention optionally include reagents for practicing in vitro amplification methods such as 
taq polymerase, nucleotides, computer software with tag selection programs and the like. The kits also optionally 
comprise nucleic acid labeling reagents, instructions, containers and other items that will be apparent to one of skill 
upon further reading. 
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BRIEF DESCRIPTION OF THE DRAWING 



Figure 1 is a Scanned image of a 1.28cm by 1.28cm high-density array hybridized with a fluoresced/ labelea 
control oligonucleotide. The array contains complementary sequences to 4,500 20mer tags selected as described in 

20 Table 1 . Control oligonucleotides are synthesized in the corners and in a cross-hair pattern across the array to verify 
the uniformity of the synthesis and the hybridization conditions. "DNA TAGS" was spelled out with control oligonucle- 
otides as well. The dark areas indicate the location of the 4 t 500 20 base molecular tags. Note that there is no cross- 
hybridization of the control oligonucleotide and the molecular tag sequences. 

Figure 2 shows a PCR-targeting strategy used to generate tagged deletion strains, (a) The ORF is iderJied from 

25 the sequence information in the database. Regions immediately flanking the ORF are used to generate the deletion 
strain, (b) The selectable marker (kanO is amplified using a pair of long primers to generate an ORF specific deletion 
construct The up-stream S6mer primer consists of (5* to 3'): 30 bases of yeast homology, an 18 base common tag 
priming site, a 20 base molecular tag, and a 22 base sequence that is homologous to one side of the marker. The 
down-stream oligonucleotides consists of 50 bases of yeast homology to the other side of the targeted ORF and 16 

50 bases that are homologous to the other side of the marker. The dashed lines representing the long oligonucleotides 
illustrates that the primers are unpurified and are missing sequence on the 5' end. (c) A second round of PCR with 
20mers homologous to the ends of the initial PCR product was used to 'flush" the ragged ends generated by unpurified 
oligonucleotide in the first round, (d) The resulting marker flanked by yeast ORF homology on either side ts transformed 
directly into haploid yeast strain and homologous recombination results in the replacement of the targetec ORF with 

55 the marker, 20mer tag, anc tag priming site. 

Figure 3 shows oligonucleotides used to generate the ADE 1 tagged deletion strain. Similar sets of oligonucleotides 
were synthesized for the other ten auxotrophic ORFS 

Figure 4 shows transformation results and tag information for eleven auxotrophic ORFS Eight colonies f -om each 
transformation were analyzed by replica plating and PCR the resulting targeting efficiency is shown for each of the 

40 ORFS. The sequence and x.y coordinates are shown for the molecular tags that were used to uniquely label the different 
deletion strains. 

Figure 5 shows the Tag amplification strategy described in Example 1 . (a) A deletion pool was generated by com- 
bining equal numbers of the eleven tagged deletion strains described in Figure 3. Genomic DNA isolated f'om a rep- 
resentative aliquot of the pool was used as template for a tag amplification reaction, (b) Tags were amplified using a 

*s single pair of primers that are homologous to the common priming sites which flank each tag. One of the common 
primers is labeled with 5' fluorescein and included in a 10-fold excess over the unlabeled primer, (c) The asymmetric 
nature of the PCR generates a population of single-stranded fluorescently labeled 60mer tag amplicons that are directly 
hybridized the high-density 20mer array which is then washed and scanned, (d) An actual scanned image of the array 
shows the (predicted) hybridization pattern for the tags with virtually no cross-hybridization on the rest of. the chip. A 

so closeup view of -the left hand corner shows the location of the tags for each of the different deletion strains 

Figure 6 shows the analysis of a deletion pool containing 11 tagged auxotrophic deletion strains. A deletion poo! 
. was generated by combining equal numbers of cells from each of the 11 deletion strains described in Figure 3. Rep- 
resentative aliquots were crown in (A) complete media (SDC). (B) media missing adenine (SDC-ADE). (C: or media 
missing tryptophan (SDC-~RP). Cells were harvested at the indicated time points and cenomic DNA was isolated. 

55 Tags were amplified from the genomic DNA and laoeied amplicons were directly hybridized to the high-density array 
for 30 minutes, washed, anc scanned. A blowup of th upp r left hand corner for each of the scans is shown 
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DEFINITIONS 



Unl9SS d** o.^se, tunica, and scientific terms = ™ JSSKS^ 
derstood by one d ordinary ski» in ^'^^ A^O^ic 

Chemistry Reactions, Mechanisms and Structure 4tn oa j. w«»y 

a general guide to many of the terms used in this '™ nt ™. ^ alen , t0 those described herein 

which are the differentiated offspring of cells which f^fJJJ 5U«m. 

proKaryo.es which are cel.u.ar organisms which «**" of a phosphate 

A "nucleoside" is a pentose glycoside in which the ^^^^ ^ g|g^^g S are ^.glycoside derivatives of D-ribose 
group the compound becomes a nucleotide. The ^^^^S^e acidic due to the hydroxy groups on 
or 0-2-deoxyric^e. Nuclides are P h«pha^ 

the phosphate. The polymenzed nuc e ot de s ^ ' S envif0nment - The nuc,eosides ol DNA and 

""-id.sadeoxyr^^^ 

unless otherwise limited, encompasses known analogs of natural nuc.eono 

urally occurring nucleotides. nucleotides or nucleotide analogues. An 

V An 'oligonucleotide" is a nucleic acid PO'lJ^^^^^'^i, „ is o» any size. 
0 ,,gon U c,eotide can be derived from natura^ 

An "oligonucleotide array' is a spatially defined pattern ot =''9° h whjch ^ ^ be(ore 

array of o.igonucleotides" is an array of spatially defined o ^J^IS**? and nct rand ° m) ' 

being constructed (/. ... the arrangement of polymers on sol. ^^^^^^^^^^^ 

A "nucleic acid reagent" utilized in standard aut omated ohgc ^£££^»ZJ^ nucleotide reagents, 
phate on the 3" hydroxy, of the ribose. Thus ^J^^^^^ phosphoramidites. phosphora- 

:rreSh^ 

a molecule while a chemical react.on ,s earned out * "J 9 ' "f^JS* Grou p S , n 0raa ro C Ctemsf/y 2nd Ed:, 
used herein can be any of those groups described in Greene ef a . «*6 P ^ on o) , 

John Wiley a Sons, New York, NY, ^' For example, in 'light- 

groups for a particular synthesis is governed by the overa ;f a f A h ^7 Dh y oto|abile protec , ing gr0 ups such as NVOC, 
directed" synthesis, discussed herein, the V^j^^^^^ 22. IMS), incorporated herein 

«— - - ps such as 

embodiments, at least one surface of the substrate ■ i pa W a nar. In ^ emb ^ wel|s 3r , he like . 

separate regions of the substrate to delineate ^"""^ S^^aTpport • 1 unctiona.ized' to permit the 
Example of solid substrates include slides, .^^J^ Support * optionally «*P" ,0 a nude ° Side 
coupling of monomers used in polymer ^^^^^^^Zvpon materials typical* are reactive 
monomer through a cova.ent .inkage to Solid support ma.er*ls include, bu, 

during polymer synthesis, providing a substratur, £ "^™ e ^ " p 0lystyre ne/latex. and carboxyl modified 
are no. limited to. gtoss. ^^^ZlSlo^ ^ 
Teflon. The solid substrates are b«logieal. nonbwta^l. '~ 9 ■ d sljces . fjlns . p i a tes. 

as pamcles, strands, precipitates, gete, sheets, < ub '"* o S ^ 

shdes. etc. depending upon the particular ^^J^^ £ J jd subsUate optional* contains 

planar but optionally taKes on embodiments, the so.tf substrate -s chosen to 

i raised or depressed regions on which synthesis takes plac* in som po ^ moriz ed Langmuir Blod- 

provide appropriate .ight-ab^ 

gettfPm, fundi nalized glass. Si. Qe. GaAs. ^Syrene polycarbonate, or combinations thereof. Other 

suchas(pofy)tetratluor ethylen . (polyjvmylidendrfluond .poiysiyrene.po.yc 
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Jeoions for different polymers with, for example, wells, raised regions, etched trenches, or the like In some embod. 
men" the sZ^Zco^^s, trenches, flowthrough regions, etc. which form all or part of the regions upon 

^VT:'^zTZ» used with reference to a ce„ or v,rus indicates that the ceil or virus encodes a DNA 
,o or RN^whose o^Mgin is exogenous to the ce,i or virus. Thus, for examp.e, recombinant cel.s optional* express nuc.eic 
acids le a RNA) not found within the native (non- recombinant) form of the cell. 

%.nn ff genr SSwtn conditions are sequence dependent and will be different with different en™en.a P a- 
ramefers 7salt concentrations, presence of organics ate.). Generally, stringent condrt.ons are selected to be about 5 
• c Si oweTthan the thermal melting point <T m ) for the specific nuclei acid sequence at a defined ionic strong* 
rs and pH Preferab^ slringent conditions are about 5' C .0 10' C .ower man the thermal melting point for a specific 
2icacd bound to a complementary nuc.eic acid. The T m is the temperature ^«Z%™^£^ 

the scLnt, base composition of the duplex, number and type of base pairs, pos,t,on of 

25 tc Jacihtate detection of specific hybridization. ~„« fl k ft ,.twr more usuallv in excess 

Stringent temperature conditions will -fV^^ 

rameters is more important than the measure ol any single parameter See ^T^^^^^ 
Biol 31 349- 370 and Wetmur (1991) Critical Reviews in Biochemistry and Molecular Biology '26(3/4). 227 2S9. 
— The^m "ident ca^nThe context of two nucleic acid sequences refers to the residues in the two sequences wh.ch 
are th/same "gned for maximum correspondence. Optima, aiignmen. 0, ^sequences for companson can be 
conducted e q by the local homology algorithm of Smith and Waterman Adv. Appl. Math. 2. 482 (1981), oy tne to 

conditions. 

DETAILED DESCRIPTION OF THE INVENTION 
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markers (mutants, polymorphisms, etc.), and to track the effect of environmental changes on the viability of tagged cells. 

For instance, with the completion of the sequencing of S. cerevisiae, thousands of open reading frames (ORFs) 
have been identified. One strategy lor identifying the function of the identified ORFs is to create deletion mutants lor 
each ORF, followed by analysis of the resulting deletion mutants under a wide variety of selective conditions. Typically, 

5 the goal of such an analysis is to identify a phenotype that reveals the function of the missing ORF If the analysis were 
to be carried out for each deletion mutant in a separate experiment, the required time and cost for monitoring the effect 
of altering an environmental parameter on each deletion mutant would be prohibitive. For instance, to identify ORFs 
which are required for synthesis of an amino acid, all of the thousands of ORF deletion mutants would be individually 
tested for the ability of the mutant to grow in media lacking the amino acid. Even if the analysis were carried out in a 

io parallel fashion using, e.g., 96-well plates, the effort required to plate, organize, label and track each clone would be 
prohibitive. The present invention provides a much more cost-effective approach to screening cells. 

In the methods of the invention, all of the thousands of deletion mutants described above can be tested in parallel 
in a single experiment. The deletion mutants are each tagged with a tag nucleic acid, and the deletion mutants are 
then pooled. Th e pooled, tagged deletion mutants are then simultaneously tested for their response to an environmental 

is stimulus (e.g., growth in medium lacking an amino acid). The deletion cell-specific tags are then read using a probe 
array such as a VLSIPS™ array. Thus, by analogy, the deletion. cell-specific nucleic acid tags act as bar code labels 
for the cells and the VLSIPS™ array acts as a bar code reader. 

While the example above specifically discusses labeling yeast cells, one of skill will appreciate that essentially any 
cell type can be labeled with the nucleic acid tags of the invention, including prokaryotes, eukaryotes, and archaebac- 

20 teria. Also, essentially any virus can be similarly labeled, as can cellular organelles with nucleic acids (mitochondria, 
chloroplasts. etc.). In fact, labeling by tag nucieic acids and detection by probe arrays is not in any way iimited to 
biological materials. One of skill will recognize that many other compositions can also be labeled by nucieic acid tags 
and detected with probe arrays. Essentially anything which benefits from the attachment of a label can be labeled and 
detected by the tags, arrays, and methods of the invention. For instance, high denominational currency, original works 

2S of art valuable stamps, significant legal documents such as wills, deeds of property, and contracts can all be labeled 
with nucleic acid tags and the tags read using the probe arrays of the invention. Methods of attaching and cleaving 
nucleic acids to and from many substrates are well known in the art, including glass, polymers, paper, ceramics and 
the like, and these techniques are applicable to the nucleic acid tags of the invention. 

One of skill will also appreciate that while many of the examples herein describe the use of a single tag nucleic 

30 acid to label a cell, multiple tags can also be used to label any cell, e.g., by cloning multiple nucleic acid tags into the 
cell. Similarly, multiple nucleic acid tags can be used to label a substance such as those described above. Indeed, 
multiple labels are typically preferred where the object of the nucleic acid tags is to detect forgery. For instance, the 
nucleic acids of the invention can be used to tabel a high denomination currency bill with hundreds, or even thousands, 
of distinct tags, such that visualization of the hybridization pattern of the tags on a VLSIPS™ array provides verification 

35 that the currency bill is genuine. 

In certain embodiments, multiple probes bind to unique regions on a single tag. In these embodiments, the probes 
are typically relatively large, e.g., about 50 nucleotides or longer. Probes are selected such that each probe binds to 
a single region on a single tag. The use of tags which bind multiple probes increases the informational content ol 
hybridization reactions by providing making it possible to monitor multiple hybridization events simultaneously 

to One of skill will also appreciate that it is not necessary to hybridize a tag directly to a probe array to achieve 

essentially the same effect. For instance, tag nucleic acids are- optionally (and preferably) amplified, e.g., using PCR 
or LCR or other known amplification techniques, and the amplification products ('amp I icons") hybridized to the array 
For instance, a nucleic acid tag optionally includes or is in proximity to PCR primer binding sites which, when amplified 
using standard PCR techniques, amplifies the tag nucleic acid, or a subsequence thereof. Thus, cells, or other tagged 

^5 items can be detected even if the tag nucleic acids are present in very small quantities. One of skill will appreciate that 
a single molecule of a nucleic acid tag can easily be detected after amplification, e.g., by PCR. The complexity reduction 
from amplifying a selected mixture of tags (i.e., there are relatively few amplicon nucleic acid species as compared to 
a pool of genomic DNA) facilitates analysis of the mixture of tags. 

In one preferred embodiment, tags are selected such that each selected tag has a complementary selected tag. 

50 For example, if a tag is cloned into an organism, the tag can be amplified using LCR, PCR or other amplification 
methods. The amplified tag is often double-stranded. In preferred embodiments, tag sets which include complementary 
sets of tags have corresponding probes for each complementary tag. Both strands of a double-stranded tag amplifi- 
cation product are separately monitored by the probe array. Hybridization of each of the strands of the double-stranded 
tag provides an independent readout for the presence or absence of the tag nucleic acid in a sample. 

ss 

Selection of Tbq nucleic acids. 

This invention provides ways of selecting nucleic acid tag sets us fu! for labeling cells and other compositions as 
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described above. The tag sets prov^^^^^ 

terist.s (,e.. similar th ""-^^^^?SS^ S» h^-ion characteristics of 
by vlS.PS™ and ot^r probe a, se f a et^pically detectable using a single set of hybridization and wash 
the tags are uniform, all of the tags m he set arc typ Y jnvention were used l0 gener ate 

conditions. As described in the Examples below, various seiec sequences - (aboul 

lists of about 10.000 suitable »;™"^J£^ 

, ,200.000,000.000)..The synthesis of a single .array with ww P p ^ l0 make a VLSlP s™ array. 

(/.*, for the detection of the tags) was earned ^^TZ^mZ inlerai that the hybridization of the tags 
Desirable nucleic acid tag sets ^^^^^^ that indrvidua. tags hybrid* only to 
to their complementary probe (i.e., in the VLS PS array) s y cornp , emen ,ary to other sequence 

their commentary probes, and do not c ^TsZs or PGR primer binding sites) that 

tags; that if there are constant, egions assoaa.ed w,th ^^ ft f set has the described properties, 

the constant regions do not hybridize to a o^^^^^^ ^ fae ^ 

any mixture of tags can be hybridized to a e°rrespc«l.n9 ar ay and „ { o( binding cf any tag 

""^.S.'SSid above are obtained by following some or al. o, the selection steps outlined beiow for se- 
iection of tag sequence characteristics. soiected h brjdization properties. Although 

(1 ) Determine all poss.ble nucle,c ac.d tags of a SG ^oj n g . , n , cr lllus , ra , iV e purposes, one 

the examples below provide ways of *^**X^fg^^^™ ** same < or MT) 
o, skill will appreciate that the tags can nave °^' en ^^ 9 ^Zm wilf also appreciate that a subset of all 
me.ting temperatures against perfectly ^^^XXsZ^ where tags are used to detect an organism, 
possible tags can be used, depending on tne a f , hcatioa ^ Nanism's genome can be usee as a 

20 mers which either occur in the o^^^^.^^^^^ of s^revisiae is available. In 
starting point for a pooi of potential ^.^Xa^tS^ be used as tags, obviating the 
certain embodiments, •^^T^^^^^n the genome are determined from the genomic 

-"dK^^^ 

in hybridization assays. . . ArtrtrtHont nn the desired nvb-idization and discrimination 

The selection of the length of the nucleic acd tag is ^ *rhgency °« ,he 

propenie, of the probe array for detecting the ^*^%££^. long r ags are not as easily discrim- 
hybridiza.ion.and washes of the hybnd.zed aadjw Sadd IpL has less of a destabilaing effect on 
inated on the array, because a s.ngle mismatth on a tnat one o( ski „ is thorough* famiiiar 

hybridization than a single mismatch on '^^^^^ W J^ M In addition to the patents and 
w*h the theory and practtoe of nucle-c "^J^^^ Gait. ed. Ofrnuc*** Synthesis: A Practical 
literature cited supra in regards to the synthesis of ^f.J"^ s ^ seaKh ie ? 17)> 5l9 7 (199 4); K.L Dueholm J. 
Approach. IRL Press. Oxford (1984); W.H.A. Kuijpers and (1 " 3) 

1 Cam. 59, 5767-5773 ^ 94 >- 

LEtoHatoryTechnk^mbKKhKiKtryandmolocvto Yofk jde a 

2 -overview of principles of hybridizalion and the strategy of nucleic ac.d probe assays . use 
basic guide to nucleic acid hybridization. oreferably between about 1 0 and 30 nude- 

hybridize to the tag set and any constant tag region ^ 8 *£ tno selectec tags as described below). 

Jnding site, e. 9 . for PGR -P«^ »» cannot hybrid* to any 

typ^Hy being used where a PGR or other ^^^^^S^c. mismatch ,a.l tags differ by 
(3) The tags are selected so that no tagt^«rtee»to. p«ba «m y mismatcnes . 3 mismatches, 4 mis- 
at least two nucleotides). Optionally, tags can be totoelaflidepefldh gcfltheapplication. 
matchesSmismatchesormoretoaprobewhich.snotpe^^^ 

Typicalfy. al. tag sequences ar se. cted to ^nd-ze only to a ^^TSS^cm .VP*** differ by 
match hybridization possibility has at least two hybnd.zat.on m.smatcnes. g «h 
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„*^nHfinee Pref rablv the taqs differ by about 5 nucleotides 
a, leas, two nucleotides when aligned (or maximal "«J«^£!nI.) 

when aligned «or maxrna. correspondence ™ ^n^i*. of a specHied length. Fa intra. 

Thetagsareoftense.ec.ed so thatthey do notha^^ 
where .he .ags are 20-mers, me tags are preferably soiecledw that rK, t g ^ ^ 

o.ides in common. One olsk^ 
length of the tag. It was empirically determined that cross nyonu 

than about 8 contiguous nucleotides in common. complementary probes used to detect 

have subsequences of 4 or more nucleotides which are W'f""^ and any associaled ccr.stant se- 

^ sr. rrrr.:: zzsz* * «. «» »° — »*» - « - - — - - ,Bai ^ 

do not self-hybridize or form hairpin structures^ . . ortfiri so that lhQV have roughly the same, and preferably 

(8, Where the tags are of a single length a ags .re select d k th t he ^ 9^ ^ ^ ^ ^ d 

exactly the same overall base composition (i.e.. the same A+T c fa temp erature for the tags, and 

Ciftenng ,engths, the A + T ro G + C ^.^^^^^L^ Lmal melting temperature. 

selecting an A+T to G+C ratio and probe ^^"J**^™ o( p ertor m,ng the above selection s.eps. Most 
One ol skill will recognize that there are a varied of , to perform the selection in each o< the steps 

typically, selection steps are performed using ^^T^^Z\ e l0 | tovv ing strategies are provided for 

outlined above; however, all of the steps are optionally pert ^ ^ ^ loac 9 nieV e similar results. 

exemplary purposes; one ol skill will recogn*e that a ^^^^^S^Z^ within or between pairs 

SESSU -eluded when any ^^^^Ze which over,p in sequence, and 4 

hairpin formation) were prohibited. romDrised onlv ol 4 As, 4 Ts or 4 G or C residues 

(b) To assure uniformity of hybridization strength, runs of 4 mers comprised oniy 
were prohibited. Excluding runs ol T/AandG/A .6 .also des,rab j ^ ^ (q 

Further selection is optionally performed to refine aspects ° ™ aadedlor alignme m purposes, 

select tags which are less like* to cross-hybrid^ 

the tagscan be lengthened^ andaddit.ona ^'JJ^^'^S^ h elected. For exan^.e, a first 
by the method above is performed, and a subset of tags ^ * 4 se , „ the second ag does 

tag from the set generated above ^elected^ and a « ^^ e ^^ ^^.ol.^ 
notcross-hybridizewithmef^^^^ 

Thus, each tag from the group selected by the methods out ineo comparison of one tag 

and selected or discarded based upon '° m P ar ^ sLiar to the steps outlined above. 

,o every other possible tag in a pool of tags ,s referred to as ' fJJ used for the sequence alignment above, 

cross-hybridization can be determined in a dynamic P7«^™ , ^ U ^ tBbll J^ caused by oositiona. 
Refinement of the above 

effects of mismatches in the probe.tag duplex. The .to »' num ™ ^ h , he tions and the lypes o! the mis- 

poteh.a. because the amount o, desta generally less destabilizing than 

.^oS^^ 

, * m =n rt n a »P similar base composition Cenain runs ol oases ano poterv.al hairpin 
(A) All tags are the same length, N and nave similar oase tu h 
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hahatical tree search with the addition of an array to keep 
number of probes). The method is essentially an alphaoe Eacn time , he addili on ol a base to the growing 

uackol which n-mers have been used ,n previous J ge nerate j ^ ^ tne next value ol the base. 

,ag creates an n-mer already used in a previou tag. ^ method ^ hybrjdization energy rule. 

(C) In this step, pairs o. tags are compared to each other using f > ^ ^ ^ „ ^ 

For each pair o. tags, the energy of ££££ rom the P |is ,. Probes are removed until there are 

energy exceeds a certain threshold one ot the tags ,s em embodjment , tne energy ru le is as follows: 

a short matcn of at least 5 bases to initiate companson. 

More sorted energy rules can ™^ n " 

Mo/ecu/ar B/o,ogy 26(3/4), 227-259, and B«*«£* ^l^t dUd. on the order o. eva.uation ol the tags^ 
The set of tags chosen using P» , ^ | ^2S^lh?L list contains more tags beginning witn A than with 
For instance, if the tags are evaluated .n alphabet^ ^ ^ ^ such (ags can a , so be used involv;ng De bru.,n 

T A more sopnisucaiea 

approach to generat.ng tne «'8"" »» - exam te a Debfui) n sequence 

incorporating all n-rr.ers could be ^^^^^Tisi into account the other goais for tags outfined 
which do not have an nmer in common. P^edure js n»* ^ ^ 2Q m9fs ^ , m . 

'^yalterna^^^ 

be computed be.o-e discarding any P^^^^iS^^na^ (not including those to tags wh,ch have 
threshold can be discarded. The remaining tag wrthttw mew _ neaf ^ atches rema , ning . 

a,ready been discarded) can be discarded. ^S^^S^rmM the tags lack a constant region. Tags 

For example, in one preferred embodiment of the above whjcn: 
are selected by selecting all possible n-mers (e.g., 20-mers), and eliminating 

(- — -eced number of subsepuences, typically from 5 to 15, in common 

with any other tag. 

of about 50.000 potential tags where the tags are 20-mers^ ^ comparison a , irs , , ag 

A pair-wise selection strategy is then performed 5^^™/^,,^ of the first tag. If the first tag binds to 
is compared to eve^ other tag in the tag set for ^f^^^^SL tag in the po.en.al se, it is kept. If 
a target with a hyorid,zaton threshold "^^^J^Sag with a hybridization energy above the 
another tag in the potential tag set bnd. to the com ^ ^ femajning in the pool 0 f potential 
selected thresho:c. the first tag is discarded. This p « «s ' s JP 63 ^ 4 (1 / gs895 .ccp) which performs the above 
^Zr^SS: EKsM -,s P rro,So t's. li 0 n 9 e preferred embodiment. ,000 tags 

(bind to the same nucleic acid with a similar energy of hybr d {Q anQtner wflh „ energy 
simifcr energy ot nybridiza.ion when a complemenBry nuc leic acid to o g ^ ^ ^ 

exceeding a spewed threshold value, e.g ^^J^Star to the-energy o. hybridization of the perfect 
i, i, binds the sane probe with an energy of % Q( Ier , he energy ol a perfectly 
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. in PvamniP P below taqs were selected by eliminating 

compm. . cons-an, pcdion M . «*» P°<"»> ™ J™ * ££££ ohosen .0 be 3XACTO.CC. 

„mp,i» no *o„ man , C Tn. 0 ™, ^ «£^ s such , ha , i, is no. M H- 

region of hybridizing bases with the exception of a ™**^™£^™ e L primal hybridization occurs. In 
even these sequences are no. ad,acent to the vanao e JJ^^jJJ ^ ^ s J e total G+C content, 
order to meet goals (1 ) and (6), the tags are .electee to have a set 0 , tags whicn CO u.d 

. to prevent cross-hybridization b ^ a ^^ a „ wteler a mis match hybridization or an over- 
not be aligned with less than two errors was chosen, wherein an erro particular, the bases 

hangingnucleotide. This was done by ^^^^^^^ constrained to be the same, 
at the ends of the tags were constrained to be the same Jhere ^ a{ , he 3. end witn eithef 

in particular, the bases were selected to start at the 5 ( ^^^3^ mBt ^ fl probes-with a single over- 
an A or T residue, followed by a G rescue. This ""^£££^^0, a G-T mismatch. This arrangement 
nangingnuc.eotideandnoothererrors. ^.^I^^SS 1 ;^! deletion would cause the probe and 
also prevents tags from matching w.th a single deie.on em* because a ^ g ^ ^ mM 

lhe , ag to be misa.igned at one end. causing a ^^^'? y S c?T or C-A m.smatches. 
in many ways to yield equivalent results, e.g., by se.eejng bases to y ^ gi ^ ^ selected 

To prevent single mismatch errors in the tag-proce ™ noted above , the next to last base 

so that the number of As plus me number of Gs ,n in coding theory by requ.rmg 

,rom the 5' end is either a T or an A). This base wc* n « manner a n^gpu.^ J QC content o{ all of the 

at least two differences exist between any ^two tag* > the '^^^^^m region have to invoke the 
selected tags is the same (see, ^^^.^^^^2^ of less than two bases leads to an 
substitution of G and C residues, or T any two tags in the tag set, satisfying (3) above, 

odd number of G+A residues. Thus, at least two ba se d.fle r b ^ ^ J^J as the parity base which ass.gns 
Similarly, the strategy cou.d be varied, e.g.. ^^XZ^^^Jan even numberof T+G residues 
whether the A+G content of the tag is even, or by ^^^^^^ perform each of the selection steps. 

Acomputerprogram in the standard 
For completeness, the program .» provided as Example °" p essentia „ y sim i, ar results. Tags.ccp uses 

programs, or perform the selection steps ouU.ned S testing every sequence of a selected 
a pruned tree search to find all t ^T^^S^Z an elegant selects program with few processmg 
length for the desired sequence characteristics. While ^ «™ le st every potential tag (or a desired sequence, 
steps, one o, ski,, wi>. recognize .J* ^^^^^S^S^ constant sequence, the variable se- 
Tags.ccp selects tag sets depending upon a variety of P*«™™ inc 9 t ^ vanable regio ns 

,0 the problem of designing error^orrect.ng codes in coding th ^ J above by naving consta n« 

letions do not have a correlate in coding theory. Th.. 9^^**™" J" the location of the parity bit. the required 
regions within the probe sequences. The strategy is gene^z* b changng ***** P ^ ^ ^ 

parity, or the locations of the constant regions, p racf(Ce 0 r Error Centre, Codes Addison- 

more differences between pairs of tags. See. Blahut 4 .saw) / 

Wesley Publishing Company. Menlo Park CA. d . conjunc t io n with any other selection 

One of skill will also recognize that pa.rw.se as (nose mple mented by tags.ccp can be 

method. For instance, the tags generated accoromg » ^ rf ^ l"^^™ 8 
lurther selected using any o. the pairwise company methods descnoed heron. 



Synthesis of Oligonuc leotide Arrays 

•iprt«rf to hav oligonucleotides complementary to the tag nucleic acids described 
Oligonucleotide arrays are selected to nav ongonuweu. 
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porymer synthesis (VLSIPS™) technology P'^^?* ^ Ap 9 pliC ation No WO 90/ 15 070). McGa.l « 

W Pi " U ^ - ^- S - ^ rcritSlMS. a Fodor eta,., PCI PublicaUon No. WO 92/10002 
a/ . U.S. Patent No. 5.41 2.087. Chee at a/. SN PCT/US94/l^ , igh t<Jirec>ed synthesis techniques, 

.escribe methods of forming vast arrays c Tai'^mTec^ues 19(3): 442-447. Fodor «a 
See a/so, Fodor el a/. (1 991 ) Science 251 .767-777. Upsnuue i ; 

^3) 555-556. and Medlin ^ ac cording.y no attempt is made 

As described above, drverse •^^^^^--V *«*•<» VLS.PS™ methods are briefly 
,c describe or catalogue all l ™^™^^L^ t ^^^ l KoSS creating ollganucleotid^ arrays, sucn 3S wmmg 
described below. Oneo. skill will under^^ 

and/or flowing reagents over defined reg.ons of a sohd suosi ^ No 53S 4.261. incorporated herein by reference 
alS o known and applicab.e to the P^enUnven ^ fea nts are typically delW ered to the substrate by 

lor all purposes). In the methods d,sclosed in these app ca^ . a ^ 
flowng or spotting polymer synthesis reagents on No. 5.143.654 and No. 5.412,087. The light 

Ught directed VLSIPS™ methods are found e g- n US -^ ^^,,^0,8 substrate or solrf 

direct methods discussed in the "654 paten typ, ^^Z ^e' ^ The ^ 
support and then contacting the substrate with ^"^ ^^^ otner regions 0 f the substrate remain inactive 
J£ light source. typically shown through '^^^ " J, patte rn defines which regions of thesubstra.e 
because they are blocked by the mask rem '"T^^2, 0 predefined regions and contacting different 
react wrth a given monomer. By repeatedly activa ,ng *« 8 ^ 1 » J» ^ js duced on the substrate. Ol-er steos. 
nencmer solutions with the substrate, a ^"^^^^ i necessary. 

su=n as washing unreacted monomer ^^J^^ hav ing photolab.le. protecting groups leg 

The surface of a sole support .. typ.caily rnodrtwd ' J reactive groups (ftfl .. typica.ty hydroxy! 

KVOC or MeNPoc) and illuminated ^^^^J^Sot'i synthesis, a *0- P hosphoramidite (or other 
c-3-.es) in the illuminated reg.ons. For insrance, dunnc ofcgonu y 5 .. hydroxy , wit h a photolabile group) is 

nucleic acid synthesis reagent) activated «^^*^;^ lh i pr i l o U .^ Following capping 

.„ »p..t.d unt. the desi.ed « "1 made. For instance, slanda.d Souh.m o, northern 

,„ addition to VLSIPS™ a™, othe, *° ™ s suc „ as pap„ s . nitroceMoss. n.on 

*-* VLS,PS " ** ' 

" Tin ^. ^^X- * 

Asdescribedac^e.severa.me*^^ 
ments. me oligonucleotides are ^'J^^J^h'^S Jhe o.igonuc,eotides to me solid substrate 
bodtaents. it is useful to synlhes.ze the 0l '9 o "^ ,, ^ s a ^ " ' P nucleic acids) can be synthesized on a solid 
,c 'orm the desired array. Similarty. nudec ^l^[f^ size(i in solution (using chemical or enzymatic 
suostrate and then c.eaved from the substrate, or they ^can be syn lh 

procedures), or they can be naturally occurring (ift present ,n a b eg oljgon ucleotides and nucleic 

' Mo.ecu.ar Coning and expression ,8 ^^ u " "'^SSft, vitr0 amplification methods surtable for the 
,*» are known in the art A wide variety of cloning and £JJ«££^ techn > and instructio „ s sufftcient to 
construction of nuc.eic acids are well-known to persons o ^ d biological nucleic acids 

oirect persons of skill through many cl °"'^ Metros in Enzymolooy 

(DNA and RNA) are found in Berger and Kimmel f '** *™™^ gt a , * m Mo , e cjlar Cloning - A Laboratory 
volume 152 Academic Press. Inc., San Diego. C , (Berger), Sambrook a (S ambrook); and Current 

J*ru/(2nd ed.) Vol. 1-3. Cold Spring ^*^J^£^ Bm ^ m r*m*Q^ -'"WW"* 
Protocols inMolecutarBiciogy. F.M. Ausubel e , at ms*. ^^,^3.) ' Nucleic acids such as Tag nucie,c aces 
Associates. Ine. and John Wiley & Sons, Inc s) using standard cloning protocols such as those 

• can be cloned into cells (thereby creating recombinant taggeoce , 
described in Berger, SamDrook and Ausbel. of nucl8ic ac id synthesis and 

Exampte o.techniquessufficienttod.rec persc^^^^^^^^^ 
amplification of tagsandprobes in solution, mclud.ng enzymat.c metnoo 



40 



45 



SO 



55 



13 



EP 0 799 897 A1 



t\ ra\ nft rAniira^amDlification (QBR), nucleicacid sequence based amplification (NASBA), 
the ligase cha.n reac f^^^^^S^ lification reaction (CPR). branched DNA (bONA) and 
strand cisplacement amplification (SDA). trie cycling p h , d tecnnjques afe tound 

other DMA and RNA po^merase mediated techniques are known. Examples of p ^ A ^ 
in Berger Sambrook, and Ausubel. as well as MuO, . • sL 987 ) (1 990) (Innis); Arnheim & Levinson 

,o Me/hods and ^p/zcal/ons (Inns at al. eds) Academ, Pres .Inc. San D ego, Cft m ^ ^ 
(October 1. 1990): WO 94,1 138 ^Vooi.s era 0993)^ ^^^^^ * Guatelli efa , , 990) P,oc. 
N/H Research (1991) 3. 81-94; (Kwoh at a/, ™^™i"r« 1B9B . land-men « a/ (1988) Sowee 241, 

Mtt « Sc/. US4 87. 1874; Lome,, f , a£ ( 1 989 W- «J ■ ^^^^K sioknanan and Ma,ek 
1077 ' 1 ? 0 ^ Bf ; nt ? 3 1 5 3 sefwSkefL 2 ,^ Barringer e,a,. {1 990) 

amplified using PCR. Oligonucleotide synthesis is optionally performed 

« «L as Tag » ueleic acids «!»-. appropriate. « used ,s .ag seou.nee. !=• «*»->8 «*» ■ 



Labels 



The ,e,-n -label- ,.le.s .» . co.posi.ion de,ec,ab,e »Y £ 

include radionucleotides. enzymes, substrates, cofactors '" h *^- fl "°^" ™ tjbodie8 polyC | 0na | antibodies, 
magnet*: particles, and the like. Labeling agents optionally '^^'^^S^tSLc acids o-oceeds 
proteins, or other polymers a ^^^^ markers. Southed, dotting, 

by any known method, including immunoblott.ng, tracing p raa ™ . . m0 | 9CU | e based upon 

northern blottihg, southwestern blotting, northwestern blott.ng, or o.he ^methods ^ ch ^ 

s, Z e. charge or affinity. The property. Such 

ol the invention. The detectable moiety can be any materia, naving a ww k > y 
detectable labels have been weiWevetoped ,n the field o^ 

in such methods can be applied to the present .nvent.on. Thus a label any orr^osm on a y \> 

photochemical, biochemical, immunochemical, electrical, ^^,JJ^SS; rad Lbels (e. 
vention mclude fluorescent dyes (e.g.. fluorescein f^ h ana ^ ere others. 

,o . polymer. The ligand Ihen » "^^^^S Zescen, compel, o. a «.m«™c.n, 
or covalemly bound 10 a signal system, such as a oeteclaoie s^v™ anlHioano, lor wample. 

compound. A number d ligands •«< anti-lgands car be used. «». •»»»■"■" '?^*MH any nu.enie o, 
bio* ihyroxine. and «** K can b. used in ™<™™J^^^l^£ aL,* ,0 signal 
amig.nic compound can 0. used in muan ««« » J"™ ^ "ereslas labels w : primarily 

general.geompounds. ? , by conjug., * ;»^™^ ,° ZSESSZZL* V~-»- 
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conjugated gold often appears pink, while various nnj-jgated beads appear the color of the bead. 



Substrates 



As mentioned above, depending upon the assay, the tag nucleic acids, or probes complementary to »g nucleic 
acids can blZn t • solid surface. Many methocs for immobi.King nuclei acids to a vanety o so ,d surfaces are 

sucstrate through nonspec.fic bonding. employed as the matenal for 

methods for noncovalently binding an assay component can be used. 



EXAMPLES 



results. 



Example 1: Parallel Analysi s of Deletion Strains otS. cerevisie. 

" P Toovercome this problem. ihdr,idua. ORF de.e:c,s were tagged with a distinguishing "^^J^ 
specific tags were read by hybridizatkxn to a high density array of oligonuc.eot.de probes compns.ng probe sets com 

P, TSec"a?,?ggin 9 strategy invo* s a four-step approach for genera,^ tagged deletion strains tha, can be 
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pooled and analyzed in parallel through selective 6 (Baudia Ozier-Kalogeropoulos el at. 

using the computer program tags.ccp (See, below and Table 1). 





Selection criteria 


Properties of 
selected 20mers 


Number of 20mers accepted 


All Possible 


None 




1.2 x10 12 


20mers 




Similar Trn(+/-7 8 C) 




Primary Filter 


No hairoins > 4bp 


51,081 


No single 


Good hybridization 






nucleot.ceruns>4 








bases 








Similar base 


No extreme 






composition (4-6 


homology 






of eacn base) 








No common 








9mers 


Unique 


51,081 


Secondary Filter 


Low stringency 




pair-wise analysis 




9,105- > (4,500 selected at random) 








2 : 643 








853 








170 




Higheststringency 


Very unique 


42 




comparisons 




1 



a 1 98 cm X 1 28 cm array comprising probes complementary to the tag sequences was produced by standard 
m^^ii^^Z resumng highly array of probes provides probe sets «^>"«°" 
n imaging using a scanning confoca. microscope permuted a.uantitat,on of the hytnd,za on 

Lnals or each of the 4.500 sets of 20mers on the array (Figure 1). Hybridization experiments wrth 1» « 
Z^MB^lOmor oligonucleotides showed that the arrays are sensitive, quantitat.ve. and htghly specie 
^Trt of a fewSS study, tagged deletion strains were generated for eleven charactenzed auxotrophic yeast 
„pn£ Kf ADE2 ADE3 ADE4 ADES AROA. AR07. TRP2. TRP3. TRP4, and TPR5) using the strategy described 
X^T^l^^^-gan^ the deletion strains are described in Figure 3 and translormat.on 

"'TheTahTwere'Sd 4 and grown in complete media and different drop-out media, Genomic DNA extracted 
from the 1. se "d asT^ TX^or an asymmetric tag amplication using a pair o, primers homologous to common 
\^:^ el tag (Figure 4). Deletion of specific strains from the pool was quant.tatr.ely measured by hybrid- 
izing the amplified tags to the high-density arrays (Figure 6A-C). 

Example 2: A Method lor S electing Tags From a Pool of Tags 

Tags (or probes complementary to the tags) are selected by e.iminating tags ^^TT^I^ J 
similar energy of hybridization. Tags bind complementary nucleic acds wrth a sim-lar energy o ' ™ 9 " * 

complement nucleic acid to one tag binds to another tag with an energy exceeding a spec,, « 
calcuteted energy is based upon. e.g.. the stacking energy of vanous base pa.rs. and the energv «st to jjtoop in the 
chain, and/ r upon assigned values for hybridization of base pairs, or on other specified hybr.dsat.on paramet re. 
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properties such as assigned stacking values tor tagprobe nyo . ^ , 0 be sjmilar 

incorporates three distinct ideas in selecting tags. The ideas are. 

r**r^s^ — *— - a,ram,e 

programming algorithm; and 

3) a hash table to quickly find perfect match segments. 
n» stacking and foop cost model for the calculation of hybridation energy 
Thecalculatedenergyisba^^^^^^ 

a matrix of the loop size on each strand: 





0 


1 


2 


3 


0 


0 


5 


10 


10 


1 


5 


0 


5 


10 


2 


10 


5 


5 


10 


3 


10 


10 


10 


10 



there is a ioop size of 1 on the first strand and 0 on the second strand, 
xamination of the table reveals a loop penalty < 



For example, if the following match occurs, ...... „ ^ ; ^ q , u 



target: A G G T A C G 
tag: T C C A T G C 



J5 



55 



c 

AAA 



T AgIaAGA 

caa|ttct 



TAAGTACC 
AACAGG 



" {Perfe ct match region 



Both energy calcu,^^^^ 
then calculate the energy of the perfect match sequence, then t.na tne ^ ^ ^ 

ine podsct match region, and after the P-^^^^^^JS. used for both n. belore 
energies. Since this mode, does not have an ^Jj^™, the be , 9 ore match Irag men,s. 
match energy, and the after match energy, by reversing me orue 
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A recursiv , heavily pruned algorithm 

n looo sizes that have a low enough cost that they can 
The recursive algorithm tries all match trees ^^ZZo matches. Code tor the algorithm is as follows: 
be paid for with the maximum possible energy for the rema.n.ng mate 

register short i,j*» 

float tempEnergy, bestEnergy = 0; 
linkSet *en, "returnValue = NULL; 

SS3££X* basesLeftlntag. maxLegalLoop + IV. '^V ■*•» *» 
max 1 aguoop nuuv . . ■ j oop si M along tog . 

for( i = 0; i < maxTagLoop, i + K ' <_» '-calculated the minimum number of base pairs 
// bother with loop is an array that has precalcuiaiea 

needed to pay for ;^ 0 ^ n(basesLeflIn T ag - i, basesLeftlnTarget - j) > bomerWithLoopHlljl; j + + M 
„ j is loop sire along target ^ ^ + . + ^ & + j + !])){ // this tests that the tag matches the 

en = calculateEnergyAndLinksFromPointOnC pp + i + 1. tp+j + 1. »B. 
target, basesLeftlnTag - i-1. ^^^J'^ ^ and the stacking energy pays for the loop, and « 
better than our previous best.... ^ ^ = = ^ ^ (tefflpEnergy = 
(stackingEnergyt*^^ 

retumValue^energy = bestEnergy - tempEnergy; 
if ( i = = 0 && j = = 0){ I' self and successor as link 
firstLink( returnValue, 2, pp. tp ); 
\ e lse{ // add successor as link 

firstLink< returnValue, 1. pp+i + L tp+j+1); 
addNewLink( returnValue, 1, pp. «P ); 

Jelse if( (i= =0) && (j— «)){ " ^^"8 base >*" is adjaCeDt W 
° ld ° ne j, if ,he sucking energy pays for the loop, and ,s better than our 

previous best.... if( (tempEnergy - en- > energy + 

^g^ergyttargettmlllurgetlrp+j + ID - !t2S = ""^ 

makeFirstLinkOneLonger( en ); 

if( returnValue ! = NULL ) farfree( returnValue ); 

returnValue = en; 
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\else // new energy to small, don't want it, just free it. 
if( en ! = NULL ) farfree( en ); 

} e lse{ // energy pays for the loop, and is better than our 

previous best.... if( (tempEnergy = en- > energy + 

sUckingEnergyfurgetltplllUrgetttp.j + HI - M^™^ - <~***™ 

if( retumValue ! = NULL ) farfree< retumValue ); 

returnValue = en; 
addNewLink( retumValue, l.pp, tp); 
}else // new energy to small, don'l want ,t, just free it. 

if( en ! = NULL ) farfree( en ); 



} 



> 



.} 



} 



return retumValue; 

> 

A dynamic programming algorithm 
The dynamic programming algorithm starts o- by making a mamx d permitted or ,e 9 ar connections between 
the two fragments. 



CT A 
AAACA 



GA 
AT 



AGAT 
T CT 



AAGTACC 
CAGG 



AA 



x | Perfe ct match region 



srr-j rrr;rj=: * — — -» «- 

belore the mismatch segment. 







A |G 
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A 
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G 




-a 




0 


0 


0 


A 
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0 


0 


C 








0 


0 


0 


0 


G 


0 


0 




0 


0 


1 


1 


G 


0 


0 


0 


0 


0 


1 


1 



The values in those 



cells are replied by the sum of the stacking energy, and the loop cost. 
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(continued) 
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0 


0 
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0 
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0 


0 


0 


0 
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1 



is the best match. 
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A hash table to quickly find perfect match segments 
list of tags. 

J 4 to t he n records long 















ttttg . 


ttttt 


aaaaa 


aaaac 


aaaag 


aaaat 


aaaca 










(One record for every n-mer in 
jtneproDe set 



Each arrow points to the 
next instance of the same 
fwner. if tnere are no more 
tt points to "NULL" 
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,o ,he comment o, any other ,ag with a stacking energy which exceeds a specified threshold. 

gyamp/fl 3: "taas.ccp" 

The computer program tags.ccp, w*»n in "C and referred to above is provided telow: 

Afinclude <stdio.n> 

include <stdlib.h> 

#include <math.h> 

^include <alloc.h> 

char outD = -TAGS.OUT-; 

char hout[] = -HIST.OUT-; 

char label[] = "ACGT"; 

^define BASES 4 //number of nucleotides 

//the following numbers include all the nonpenodic b*s* a ^the 
end of the primer (in this case one Q in addmon to the teg 

IK™ PRETAG 0 //bases of primer preceding tag 

S 20 //.e ngt hofte 8 p.uspretag 

Idefine GCLIM 11 //«« «ri of G's and C . b.ed 
define ATLIM U //«« «oul o A s and T's allowed 
define AALIM 7 //max total o A s aUowed 

define CCLIM 6 //max total of C s a lowed 
define GGLIM 6 //max total of G s allowed 
Mrfine TTLIM 7 //max total of T's allowed 



50 



35 



40 



50 



21 



EP 0 799 897 A1 



10 



20 



25 



40 



45 



*a^» mttmfrS 256 //number of fourmers 

fdlSe SaT 10 /.engthoflongmatchesproh^ed 

#define LNUM 1048576L //number of longmers 

^define MXNUM ( 32768 / sizeof(int)) 

short numVectors; 

intfar ** IngVectors; 

int test base(int cufrent_base); 

i nt " remove - base_data(int current_base); 

int complement(int founner); 

double seqjo int(void); 

£ ^SSSSS^SES]; allowed bases each pos^n 

£ founuersfNUMERSl; //idenUfies pronged 4mers 

c^/inencef LENGTH1 ; /rt>ase al specified position 

7 Si »r of occurrences of each base .0 left of 

at basecnt[BASE>], // CU rrent_position 

• ncMrTHi- //position past (or =) to which an occurrence of 

i„t spacings[LENGTH], "P™ P //a comp i eme nUry fourmer to the one at this 

//position is prohibited 

FILE *outfile,*houtfile; 
nuinO 



^ • ' . i i k cbs- //counters 

int i,J,k,cd5, //temp storage of longraer 

I" 8 S'lENGTH] = {2.0.0 ( 0,1.0,0,0,2,0,2,2,2,3,2h //sample tag 

3 o long xpas; //number of passes thru loop over 4 

lone maxtag; //max number of different tags j .. 

in? current base; //current sequence position w.thin backtrackmg 
int tag count: //count of acceptable tag sequences 
„ ^T. //flag for whether tag was acceptable 

m //flag indicating completion of all poss.ble tags 

55 double cursor //integer representation of current tag sequence 

Sole prlv S,; "//integer reputation of prev.ousiy tested tag 

int nistogram[LEMGTHl; //histogram of mismatches to cseq 
int mismatches; 

numVectors =» (LNUM/MXNUM); t , _ 

IngVectors - (int * *) fercalloc(numVectors, sizeof(int far •)), 

if(lngVectors==NULL){ ... 
printf<"\nerror allocating IngVectors! ); 

exit(l); 

> 

for(i=0; i<numVectors; ++1) { . 

IngVectorstt] = (int * ) farcalloc(MXNUM, sizeof(int)); 

ifdngVectorsfn = = NULL) { 

printf("\nerror allocating vectors: 1 = *0\n ,1), 

exit(l); 

} 

} 

//open output file 
outfile = fopen(out/wt"); 
if(NULL =- outfile) { 
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printfOnenor opening output file fcsW.out); 
exit(l); 

//open houtput file 

houtfilfi = fopen(hout/wt # ); 

fclose(outfile);exit(l); 

} 

//initialize histogram 

f or (i=0; t< LENGTH; + +i)mstogramlil = 0. 

^SSySr + + OiW-0:k<BASES 5 ++ k)p«P]M " 1 
//initialize 4mer array 

forCi=0; i<NUMERS; ++0 fourmeraM - 1000> 
SSHrLMAT-O^Vec.orsCi/MXNUMKi.MXNUM] = 0; 

= 11701 - founnersl2S51 = " l: 

//initialize sequence 

r- a. LENGTH; + -0 sequence^] = u. 
SatlpS d,s r e g ard income, words at begins 
for(i=0; i<3; ++i) spacingsM = 1000; 
//initialize spacings to require 6mer palindrome 
♦x,/;-v i< LENGTH; + +') spaeingslU = > + A 

Sencc, current posi.ion.base count.tag count etc. 

//start at -largest" sequence we know » wrong 
sequence^] = -1; 

B S^ba°«nt [ ll = ba S ec„t [ 21 = ba a! =n t( 3] = 0; 

tag count = 0; 
prev_seq = -1; 

//initialize fourseq for fourmers up to 
//don't need to since it is at zero and it will be reset 
//initialize fourmers for fourmers up to «"^>» , 
//don't need to do anything because only one is at zero and 
// it wiU be removed shortly anyway 
//calculate maximum number of different _tegs 
Mmaxug = U=0; i<LENGTH-PRETAG-l, { 
maxtag *= BASES; 

printf("\nmaxug = rold\n\maxtag); 

//THIS IS A DEBUG STATEMENT!! REMOVE LATER . 
maxtag = (long) 10000000000; 

pass — 0; 

//until all orobes are exhausted . , 

££l I xp^O,exflag=0 ; «flag!= 1 && x P as< maxtag; ++J ){ 

/+ if(i — 4i { 

+ +xpas; 
j=0; . 

} 

*/ 
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//DEBUG 

//fprintf(outfiIc/\nNOW: "); 
//print_sequtnceO; 

//backtrack to last incrementable base 

for(; sequence[curren L base] = = BASES-1 1 1 (pass= = I && current.base > 7); 

—current base) { 

//remove current base data 
remove_base_data(current_base); 

//set base to zero 
scqucncef current base] = 0; 

if(current>ase < P RET AG) { 
exflag =1; 
continue; 

//remove current base data 
remove J)ase_daia(current_base); 
//increment current_base 
4- + sequence! current _base]; 
//error checking to ensure sequence is increasing 
//DEBUG 

//rprintf(outfile/\nUPD: "); 
//print_sequence(); 
/* curseq = seq_toJnt(); 

if(cur_seq < = prev_seq) { 

print sequenceO; ola \ - 

print«-\n\n!!ERRROR: current_base = %d, cur_seq = %d, prev_seq - *d, j - 

% d\n " ,current_base,cur_seq , prev_seq j); 

fclose(outfi]e);exit(i); 

} 

prev_seq = curseq; 

*/ 

//update base data and test until reach end or failure 

for(pass = 1; current_base < LENGTH AA pass - = 1; + +current>se) 

pass = test_base(current_base); 
//if testing was successful, print sequence 
— current_base; 

//extra check to be sure not repeating longmers 
if(pass== 1){ 

//record prohibitions on 9mers 
for(cbs - LMAT-1; cbs < LENGTH; + +cbs) { 
c b = longseqfcbsl; 

if(IngVectors(c_b/M)fNUM|[c_b%MXNUMl !« 0) { 

primf("\n\n!!ERROR: already matching 9mer found: c_b = 

%ld, current_base = %d, longseq = %ld", 

c_b, cbs, longseqlcbsj); 

fclose(outfilc);exit(l); 

> 

50 //record longmer prohibitions 

for(cbs = LMAT-1; cbs < LENGTH; + + cbs) { 
c b = 1ongseq[cbs]; 

lngVectors[c_b/MXNUM][c_b%MXNUM]=l; 
55 //increment tag count 
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+ +tag_count; 
//print tag sequence 
rprintf(outfile/\n'); 
prinl_sequence<); 
/• //calculate mismatches with ref sequence, cseq 

mismatches = 0; 

for(i=0; i< LENGTH; + + 0 if(cseq[i] != sequence[i|) + ^mismatches; 
+ +histogram[mismatches]; 
if(mismatches < = 2) { 

fprintf(houtfile/\n"); 

for(i =PRET AG; i< LENGTH; 
fprintf(houtfile/%cMabel[sequeace(i]]); 

} 

+/ 

//fprintf(outfile," OKT); 

} 

//if exited on j > =MAX record error and exit 
if(j> =maxtag) 

printf("\n!!ERROR: exceeded allowable passes through primary toop\n ); 
//print tag count 

fprintf(outfile.'\n\nTag Count: %d\n\tag_count); 

Mi=0; i< LENGTH; ++i) fprintf(houttlle."\n%d mismatches: %d\i.histogramli]); 

fclose(outfile); 

fcloseflioutfile); 

retara(l); 

} 

int test_base(int currentj>ase) 

int comp; //complement 
long c_b; 

//increment base count (must be accomplished before exiting this function) 
+ +basecnt[sequence[cuiTent_basell; 
//fprintf(outfile/\nbase « %d •,current - base); 
//test base count, return upon failure 
if(basecnt[0] basecnt[3) > ATLIM) { 

//fprintf(out file,' failed AT limit"); 

return(O); 

if(basecnl[l] + basecnt[2l > GCLIM) { 

//tprintf(outfile/fttiled GC limit"); 
return(O); 

} 

if(basecntl01 > AALIM) { 

//fprintffouiFile,* failed A limit'); 
return(O); 

if(basecnt[l] > CCUM) { 

//fprintf(outfile, "failed C limit"); 
return(O); 

} 

if(basecnt[21 > GCLIM) { 

//fprintfioutfile," failed G limit'); 
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return(O); 

) 

if(basecnt[3] > TTLIM) { 

//fprintf(outfiIe/ failed T limit"); 
rttum(O); 

} 

//test if base matches patterns, return upon failure 
if{pattem[current_ba5e][sequence[current_base]] != i) { 

//fprintf(outfile." failed pattern match"); 

return(O); 

} 

//if at last base verify checksum 
i recurrent base == LENGTH- 1) 

jf(0 » ((basecnt[0]+basecnt|2])%2)) { 

//fprintf(outfile," failed checksum"); 

return(O); 

} 

'/ 

//compute current 4mer 

if(current_base = = 0) fourseq[0] = sequence[currene _base]; 

else fourseq[current_base] = (4*fourseq[curTent_base-l])%256 + sequencefcurrent basej; 
//if this is a full 4mer check 4mer and return upon failure 
if(curreot_base > 2) { 

comp - complement(fourseq[current_base]); 

if(fourmers[comp] < = current base) { 

//fcrintf(outfile/fourmer failedxurt: %d, comp: %d, spacing: %d", 
ti foui^eq[cuirent_base],comp,founners[comp]); 

retum(O); 

} 

} 

//compute current 9mer if full 9mer 

if(current_base = = 0) longseq[0] = sequence[current_basel; 

else longseqlcurrenLbase] = (4*longseq[current_base-l])%LNUM + sequence[curTent_basel; 
//if full longmer check 9mer and return upon failure 
if(current_base > LMAT-2) { 

c b = longseq[current base); 

iftfngVectorslc b/MXNUM][c_b96MXNUM]= = 1) { 
//printfChello'); 

//fprintf(outfile. "longmer failed:c_b: %Id, lngVectors(cJ)]: 
%d*,c^b,IngVectors[c_bl); 

rerurn(O); 

) 

) 

//record prohibitions on 4mer 

if(current_hase > 2 &jSl founners[fourseq[current_base]] > spacings[current_base]) 

fourmerslfourseqfcurrent^basel] = spacings(current_base]; 
//all tests passed!! Add new base: 
//fjprintf(outfile. 'passed! "); 
//passed all tests 
return(l); 

} 

int remove_base_data(tnt current__base) 
{ 
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int i; 

int curt; . //current fourmer 
i^sequsacelcurrent^base] < 0) return(I); 
//calculate complement 
curt = fourseq[current_base]; 
//remove current 4mer prohibitions 
if(fourmers[curt] > -D fourmers(curt] « 1000; 
//reinstate prohibitions for previous copies of this 4mer 
for(i=0; i < current_base; + +i) { 
ifffourseqfi] == curt) { 

if(fourmers[curt] > spacing^]) fourmers[curt] = spacingsli, 

} 

} 

//adjust base count 
-basecnt[sequence[current_base]] ; 

retum(l); 

int complement (int fourmer) 

{ 

int r, 
int comp; 

//assuming fourmers in four bases 
for(i=0, comp-0; i<4; + -ri) { 

W +~« 3.(fourmer%4); //add complement of base 

fourmer /=» 4; 

} 

return(comp); 

} 

double seq_to_int() 

{ 

int i; 

double seq_int = 0; 
//assumes 4 bases 

for(i =PRETAG; i < LENGTH; + + 0 { 
seq_int *= 4; 
seqjnt + = sequence[i]; 
//pfintf("\n%ld",seq_int); 

} 

return(seqjnt); 

int print sequenceQ 
{ 

fetf-PRETAG; i < LENGTH; + + i) fprmtf(outfile."%cMabel[sequence[i]]); 
return(l); 



Example 4: Taas69S.ccp 

(Tags895.ccp) written in B C is provid d below. 
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^include <stdio.h> 
^include <stdlib.h> 
^include <math.h> 
^include <alloc.h> 

char outfl = "TAGS.OUT; 
char hout[] = "HIST.OUT"; 
char label[] = "ACGT"; 

#define BASES 4 //number of nucleotides 
//the following numbers include all the nonperiodic bases at the 
//end of the primer (in this case one Q in addition to the tag 



#define PRETAG 
#define LENGTH 
#define GCUM 11 
#deflne ATLIM 11 
^define AALIM7 
^define CCUM 6 
^define GGL1M 6 
^define TTLIM 7 
^define NUMERS 
tfdefine LMAT 



0 //bases of primer preceding tag 
20 //length of tag plus pretag 
//max total of G's and Cs allowed 
//max total of A's and T's allowed 
//max total of A's allowed 
//max total of Cs allowed 
//max total of G's allowed 
//max total of Ts allowed 
256 //number of fourmers 
10 //length of long matches prohibited 



^define LNUM 1048576L //number of longmers 
^define MXNUM ( 32768 / sizeof(int)) 
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short numVectors; 
int far *+ Ing Vectors; 

int test_base(int current_base); 

int remove_base_data(int current_base); 

int complement(int fourmer); 

double seqjojnt(void); 

int print_sequence(void); 



int pattem[LENGTH][BASES]; 

int fourmers[NUMERS]; 

int sequence[LENGTH]; 

int fourseq[LENGTHl; 

long longseq[LENGTH]; 

int basecnt[BASES]; 

int spacings[LENGTH] ; 



FILE *outfile,*houtf]le; 



//identifies allowed bases each position 
//identifies prohibited 4mers 
//base at specified position 

//fourmers ending at specified position 
//longmers ending at specified position 
//number of occurrences of each base to left of 

//current jposition 
//position past (or =) to which an occurrence of 

//a complementary fourmer to the one at this 
//position is prohibited 



mainO 
{ 



int 
long 
int 
long 



i,j,k,cbs; //counters 
temp,c b; //tem P stora S e of l° n 6™ er 

cseqfLENGTH] = {2.0,0,0,1,0,0,0,2,0,2,2,2,3,2}; //sample tag 
xpas; //number of passes thru loop over 4 
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long maxtog; //max number of different tags 

int current base; '/current sequence position within backtracking 

int tag count; //count of acceptable tag sequences 

pasS; "flag for whether tag was acceptable 

^ t exflag; //flag indicating completion of all possible togs 

double cur_seq; //integer representation of current tag sequence 

double prev seq; //integer representation of previously tested tog 

int hjstogram(LENGTH]; //histogram of mismatches to cseq 

int mismatches; 

num Vectors = (LNUM/MXNUM); 

IngVectors = (mt * *) farcalIoc(num Vectors, sizeof(int far *)); 
if(lngVectors = = NULL) { 

printfOnerror allocating IngVectors! "); 

exit(l); 

for(i=0; i < num Vectors; > + i) { 

lngVcctorsfi] = (int * ) farcal!oc<MXNUM, sizeof(int)); 
if(lngVectors[i] = - NULL) { 

printfC\nerror allocating vectors: i = %d\n",i); 

exit(l); 

} 

} 

//open output file 
outilie = fopen(out, rt wf); 
if(NULL - - outfile) { 

printf("\nerror opening output file %s\n\out); 

exit(l); 

} 

//open houtput file 
houtfile = fopen(hout/wf); 
if(NULL=-houtfile){ 

printf("\nerror opening output file %s\n\hout); 

fdose(outfile);exit(l); 

} 

//initialize histogram 

for(i=0; i< LENGTH; + +i) histogramfi] = 0; 
//initialize pattern array 

Mi =0;i< LENGTH; + +i) for(k=0; k< BASES; + +k) pattern[i][k] = 1; 
//initialize 4mer array 

for(i=0; i<NUMERS; ++i) fourmersfi] = 1000; 
//initialize °mer array 

for(i=0;i<LMAT; + +i) IngVectors[i/MXNUM]fi#MXNUM] = 0; 
//Runs marked 

fourmers[0] = fourmers[85] - fourmers[170] = founners(255] = -1; 
//initialize sequence 

forft=0; i< LENGTH; + +i) sequence^] =0; 

//initialize spacings to disregard incomplete words at beginning 

for(i=0; i<3; + +i) spacingsfi] = 1000; 

//initialize spacings to require 6mer palindrome 

for(i=3; i< LENGTH; + -r i) spacings(il = i + 2; 

//initialize base sequence, current position,base count,tag count etc. 

//start at "largest" sequence we know is wrong 
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*/ 

30 



35 --current _base) { 



sequence(01 = -J; 
current base = 0; 

basecnt{0] = basecnt[i] - basecnt(2] - basecnt[3] = 0; 
tag_count = 0; 
prev_seq = -1; 

//initialize fourseq for fourmers up to current J>ase 
//don't need to since it is at zero and it will be reset 
//initialize fourmers for fourmers up to current _base 
//don't need to do anything because only one is at zero and 
// it will be removed shortly anyway 
//calculate maximum number of different tags 
for(roaxtag = U=0; i< LENGTH-PR ETAG-1; + + i) { 
maxtag +=* BASES; 

printfOnmaxtag = %id\n\maxtag); 

//THIS IS A DEBUG STATEMENT! ! REMOVE LATER 
maxtag = (long) 10000000000: 

pass =■ 0; 

//until all probes are exhausted 

for(j=0, xpas=0,exflag-0; exflag ! = 1 &8c xpas<maxtag; + +j) { 
if(j-=4){ 

+ +xpas; 
j=0; 

> 

//DEBUG 

//fprintf(outfile/\nNOW: 

//print_sequence(); 

//backfrack to last incremsntable base 

for(; sequence[currentj>ase] == BASES-1 1 1 (pass= = 1 current_base>7); 



//remove current base data 
remove_base jJata(current_base) ; 

//set base to zero 
sequence(cun*enl_base] «= 0; 

40 !f<cumntj>ase < PRETAG) { 

exflag = 1; 
continue; 

4S //remove current base data 

reraove_base_data(currentJ>ase); 

//increment current_base 

+ +sequence[current_base]; 

//error checking to ensure sequence is increasing 

//DEBUG 

//fprintf(outfile,"\nUPD: "); 

//print_sequence0; 
/* cur_seq = seq_toJnt(); 

if(curj>eq < = prev seq) { 
print_sequence0; 
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primf('\n\n!!ERRROR: current_bas£ = %d, cur_seq = %d> prev_seq = %d, 

%d\n\current basc»cur_seq,prev_seq,j); 

fc!ose(outfile);exit(l); 

} 

prevseq = cur^seq; 

*/ 

//update base data and test until reach end or failure 

fortpass = I; current_base < LENGTH 8Jl pass = = 1; + + current J»se) 

pass = test^bas^current^base); 
//if testing was successful, print sequence 
— currentbase; 

//extra check to be sure not repeating longmers 
if(pass==i){ 

//record prohibitions on 9mers 

Mcbs = LMAT-1; cbs < LENGTH; + +cbs) { 
cb = longseq[cbs]; 

i«lngVcctors[c b/MXNUM][c_b%MXNUMl !- 0) { 

printf("\n\n!! ERROR: already matching 9mer found: c_b = 

%!d, curTent_base = %d, longseq = %ld", 

c_b t cbs, longseq(cbs)); 
fclose(outfile);exit(l): 

} 

) 

//record longmer prohibitions 
Mcbs = LMAT-I; cbs < LENGTH; 4-+ cbs) { 
c b = longseq(cbsl; 

IngVectorsfc b/MXNUM][c_b%MXNUM]= I; 

} 

//increment tag count 
+ +tag_count; 
//print tag sequence 
fprintf(outfile/\n"); 
print_sequence<); 
/• //calculate mismatches with ref sequence, cseq 

mismatches = 0; 

Mi=0; i< LENGTH; + + i) if(cseq[i] != sequence^) ++ mismatches; 
+ + histogram! mismatches); 
if(mismatchea < = 2) { 

fprintffhoutfile/Vn"); 

for(i - PRETAG ; i < LENGTH ; + +i) 
fprintf(houtfile/ ftc\Ub*l[sequence[i]]); 

) 



//fprintf(outfile.' OK!*); 

} 

//if exited on j > =MAX record error and exit 
if(j>=maxtag) 

printf( a \n!'! ERROR: exceeded allowable passes through primary loop\n ), 
//print Ug count 

fprintf(outfi!e/\n\nTag Count: %d\n\ttg_count); 

for(i=0; i< LENGTH; + +i) fprintf(houtfile/\n9td mismatches: %d\i,histogram(iI); 

fclose(outfile); 

fclose(houtftle); 
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retarn(l); 

} 

int test base(int current base) 
{ 

int comp; //complement 
long c_h; 

//increment base count (must be accomplished before exiting this function) 
+ +basecnt[sequence[current_base]]; 
//fprmtf(outfile/\nbase = %d *\current_base); 
//test base count, return upon failure 
if(basecnt[0] + basecnt[3] > ATL1M) { 

//rprintf(outfile, "failed AT limit"); 

return(O); 

} 

if(basecnt[l] + basecnt[2] > GCLIM) { 

//fprintf(outfile, "failed GC limit"); 
return(0); 

if(basecnt[0] > MUM) { 

//fprintf(outfile/ failed A limit'); 
return(O); 

} 

if(basecnt[l] > CCL1M) { 

//fprintffoutfile.. "failed C limit"): 
return(O); 

} 

if(basecnt[2] > GGLIM) { 

//fprintf(outfile, "failed G limit"); 
retum(0); 

} 

if(basecnt[3] > TTUM) { 

//rprintf(outfl]e, n failed T limit"); 
return(O); 

} 

/* 

//test if base matches patterns, return upon failure 
if(pattem[current_ba»][sequence[current_base]l != 1) { 

//fprintffoutfile," failed pattern match"); 

returo(O); 

} 

//if at last base verify checksum 
if(curreat_base = = LENGTH-1) 

if(0 = = ((basecnt(0]+basecnt[2])%2)) { 

//rprintf(outfile, "failed checksum"); 

retum(O); 

} 

*/ 

//compute current 4mer 

if(current_base = = 0) fourseq[0] - sequencelcurrent _base]; 

else fourseq[current_base] = (4*fourseq[current_base-l])%256 + sequence[current_base]; 
//if this is a full 4mer check 4mer and return upon failure 
if(current_base > 2) { 

comp - compleinent(fourseq[current_base]); 

if(fourmirs[comp] < = current_base) { 

//fprintf(outfile/fourmerfaiIed:curt: %d, comp: %d, spacing: %d\ 



32 



EP 0 799 897 A1 



w 



20 



25 



30 



// f 0 uiseq[curTent_base],comp,fourmers[comp]); 
retum(O); 

//compute current 9mer if full 9mer 

//if full longmer check 9mer and return upon failure 
if(current_base > LMAT-2) { 

c b = longseq[current base]; 

irXlngVectors[c_b/MXNUM][cj3%MXNUM]== = I) { 

//printfChello - ); ir ... 

//fprintf(outfile/loogmcr fail«i:c_b: ttld. lngVectors[c_b]: 

%d" ,c_b,lngVectors[c_b]); 

retum(O); 

} 

fourmers[fourseq[current_basel] - spacmgs[currentj,asel; 
//all tests passed!! Add new base: 
//fprintf(outfile,"passed! "); 
//passed all tests 
return(l); 

\ 

int remove base_data(int current J>ase) . 



int i; 

int curt; //current fourmer 
if(sequence[current_base] < 0) return(l); 
//calculate complement 
curt = fourseq[current_base]; 
//remove current 4raer prohibitions 
if(fourmers[curt] > -1) fourmers[curt] = 1000; 
//reinstate prohibitions for previous copies of this 4mer 
35 for(i = 0; i < current_base; + + i) { 



} 
> 

*o //adjust base count 

«basecntlsequence(currenl_base]]; 

retum(l); 

int compiement(int fourmer) 

int i; 
int comp; 

//assuming fourmers in four bases 
so fdr(i=0t comp-0; i<4; + + i) { 



^fc^ui] > spacingsW) fcrn^cu-l = spacing*,; 



ctmj *+ = 4 3-(focrmer%4); //add complement of base 



fourmer /= 4; 
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retum(comp); 

} 

double seqjo intO 
{ 

int i; 

double seq_ict = 0; 
//assumes 4 bases 

for(i = PRETAG; i< LENGTH; { 
seq_int *= 4; 
seq_int 4- = sequencefi]; 
//printf('\n9&ld".seq_int); 

> 

retvrn(seq_int); 

} 

int print sequen<*0 

{ 

for(i=PRETAG; i < LENGTH; + +i) fprintf(outfile,'%cMabel[sequence[i]]); 
return(l); 

} 

A truncated output list is provided below: 

+*/SAMPLE OUTPUT FILE FOR TAGS895.CPP /** 

AAACAAACACCCGCGTGGTT 

AAACAAAGACCCGCCGGTGT 

AAAC AAAT ACCCG CCGTG G G 

AAACAACAACCCGCGTGTGG 

AAACAACCAACCCGGTGTGG 

AAACAACGAACCCGCTGGTG 

AAACAACTAACCCGCTGTGG 

AAACAAGAACCCGCCGTTGG 

AAACAAGCAACCCGGCGTGT 

AAACAAGGAACCCGCCTGGT 

AAACAAGTAACCCGCCTTTG 

AAACAATAACCCGCGCTGGG 

A A AC AATC AACCCG CTTG G G 

AAACAATGAACCCGCGTCGG 

AAAC A ATTAACCCG CGTTCG 

AAACACAAACCCGGCTGGTG 

AAACACACAACCCGTGGTGG 

AAAC AC AG AACCCG CTTTGG 

AAACACATAACCCGGCGGTG 

AAACACCAAACCCGTTGTGG 

AAACACCCAAACCGTTGTGG 

AAACACCGAAACCCTGTGGG 

AAACACCTAAACCCTTGTGG 

A AAC ACG AA ACCCGGTCGGT 

AAACACGCAAACCCGGTGGT 

AAACACGGAAACCCTCGGTG 

AAACACGTAAACCCGTCGGT 

AAACACTAAACCCGTGCGGT 

AAACACTCAAACCCTGGTGG 

A AAC ACTG AA ACCCGTCTGG 

AAACACTTAAACCCGTTCGG 
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AAACAGAAACCCGCTCGGTG 
AAACAGACAACCCGGCTTGG 
AAACAGAGAACCCGGCCTTG 
AAACAGATAACCCGCTCTTG 
AAACAGCAAACCCGTGGCGT 
AAACAGCCAAACCGCGTGGT 
//many lines removed here // 

Tag Count: 14507 



rated by reference. „,,^onri«rrihPd in some detail by way of illustration and example (or purposes 

sooended claims. 



Claims 

lected tag nucleic acids with minimal cross hybridization to the nucle.c acid. 
2. A method of claim 1 . wherein the method of selecting tag nucleic acids further comprises: 

have more than 8 contiquous nucleotides in common with any previous tag, 

selected thermal binding stability, thereby excluding the second nucle,c ac.d from the selected set 

acid tags; 

binding stability, preferably thermal binding stability be.ng selected by specifying a ratio of G + c 
for the tag nucleic adds, and specifying a length for the tag nucle«c acids. 

in length . 

A method of any one o, clrtn. 1 to 4, wherein the tags are between 15 and 30 nucleotides in length or between 
10 and 100 nucleotides in length, preferably 20 nucleotides in length. 

erably all of the tags having th same length and the same GC to AT ratio. 



5. 



6. 
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Variable n U c. e o,,de in ^ 

mentary nucleotide for each nucleotide n *e vanablo ag q 

region ol the tag nucleic acid, thereby providing a selected set p 

correspondence; preferably 

, he «, n^er o. Hue***. ■» «* o, «. and. 

the group consisting ot A and T. 

»■ : — ^^^^^^^ 
^^^^^^^^^^^^ 

between any two of the tag nucleic acids in the set ol tag 
and preferably wherein the tags also comprise a constant region. 

substantially similar T m . ^ ^ 

to hybridize to the group of tag nucleic acids. 

chain reaction. 

„ » ^ , casssr ~* ** **~ "* " ' 

stantially uniform hybridization properties and do not cross nyo 
selec tingaratioofG^ 

determining al, possible 4 ^^^^^S^^ subsequence, wherein 4 nucleotide 
excluding all probes from the array wh.ch contain , proh ^ ^ ^ C0nsisting 0 , 
• subsequences are prohibited when the ^^^^L^ pro bes complementary to constant 
selKomplementary probes. A. probes. T 4 probes IG.CU probes, 
region sub-sequences; 

optionally wherein the method further comprises 



11 
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16. A method of c.aim , 5, me method .unh« comprises seeing conho, prooes «0, addi.ion .0 me .-ray. 

17. A merhod ol alerting . pk»al«y cl nucleic acids in a sampl.. ccmprishg 

(i, p^idino an a„ay =1 .*perimen B l Cgonuelecide probes, which probes dc .no, «™| under strin- 

(Si) Setecung hybridization o. the nucleic acids to the array of oligonuc.eo„de probes. 

1 8. An array o, oligonuc.eotide probes comprising a p.ura.ity o, experiment o.igonucleotide probe-sets attached to a 

solid substrate, wherein 

ea ch oligonucleotide prcb. », in the ana, hybrids .o a di.eren, urge, nuclete acid under 

the nucleic acid probes do not cross-hybridize in the array. 

19. An array ot claim tBfurther defined by any one or more o. the to.lowing features (a) io(l)> 

(., each probe set in the array having a constant regbn. wherein the variable region does not cross hybridize 
with the constant region under stringent hybridization conditions; 

(b) each probe set in the array differing from eve* other probe set in the array by the arrangement of at leas, 

two nucleotides in the probes ot the probe set. 

(c) the ratio oi G + C bases in each probe for each experimental probe set being substantial* identical; 

( d) thearraycompns^ 

tags.ccp; 

(e) the array further comprising a nucleic acid bound to a probe in the array; and 

(f) the array further comprising control probes. 

20. A me,hod o, deteciing a ran,., nucleic acid combine ^ '»« 

substrate, wherein 

each experimental oligonuc.eotide probe set in the array hybridizes to a different target nuc.eic acid under 
stringent hybridization conditions; comD rises variable region; optional* the. probes ot 

region under stringent hybridization conditions; and where.n 
the nucleic acid probes do not cross-hybrid.ze m the array; 

• h „ alt „ „n mn n Ses a control probe, and wherein the method further comprises hybridizing 
preferably wherein the array also comprises a coniroi piuuo. 
a nucleic acid complementary to the control pr be to the array. 
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u- . -Mo^mnriBinotaa nucleic acids selected from a set ot tag nucleic acids, which set of 
"faS™ 

-hva~^^ 

irr^S in the se, o, tag nucleic acids cannot be aligned -h ,ess than two di«erences 
between any two ot the tag nucleic acids in the set of tag nuc.e,c acuta. 

22. A recombinant cell of claim 21 which is:- 

(a) selected from a library of genetically distinct recombinant cells; 

(b) a eukaryotic cell; 

(c) a prokaryotic cell; or 

(d) a yeast cell. 

23. A kit comprising an array of oligonucleotides, wnerein 

lne array o, oligonucleotide probes comprises a plurality of experimental oligonudeotfce probe sets attached 
^CSS oligonucleotide probe set in ,he array hybridizes to a d.erent target nucleic acid under 
stringent hybridization conditions; CO mDrises a variable region, optionally each oligo- 

. bridize wrth the constant region under stringent hybridizat.on condrt,ons, and 
the nucleic acid probes do not cross-hybridize in the array. 

24. A kit of claim 23, wherein the kit further comprises:- 

(a) a P lura.ityo.tagnuc^ 

(b) control oligonucleotide probes; and/or 

(c) PCR reagents, a container and instructions. 

25 . The use of nudeic acid sequences as tags for components of a library such ^ to perm* component identified 
by tag hybridization to a probe array. 
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20mer Array 

Hybridized with control oligonucleotide 
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PCR Deletion Strategy 
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Sovn-stroaa €Sm_r 
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Figure 3 



Transformation Results 




Figure 
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TaaAnalysis 
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Figure 5 
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